MBON Acoustic Indices Study

Biostats Review — Methods & Preliminary Results

2025-12-12

Research Question

Can acoustic indices predict biological community metrics in estuarine environments?

  • Location: 3 stations, May River, South Carolina
  • Period: 2021 (full year)
  • Responses: 9 community metrics
    • Fish: activity, richness, presence
    • Dolphins: echolocation, burst pulse, whistle, total activity, presence
    • Vessels: presence
  • Predictors: ~60 acoustic indices (candidates)

Data Overview

  • 13,102 observations (2-hour temporal bins)
  • 4 data sources aligned to common resolution:
Source Description
Detections Manual annotations of fish/dolphin/vessel presence
Environment Temperature, depth (sensor data)
Acoustic Indices ~60 indices across 5 categories
SPL Sound pressure levels
  • Index categories: Amplitude, Complexity, Diversity, Spectral, Temporal
  • Temporal structure: station / month / day / hour

Pipeline Overview

Stage 00: Data Alignment (4 sources → 2-hour bins)
    ↓
Stage 01: Index Reduction (60 → 14 via correlation/VIF)
    ↓
Stage 02-03: Response Variables + Feature Engineering
    ↓
Stage 05: GAMM modeling (mgcv::bam)

Model choice: GAMM (Generalized Additive Mixed Model)

  • Allows non-linear (smooth) relationships between predictors and response
  • Increasingly common in ecological literature for this type of study
  • Preliminary comparison with GLMM showed strong GAMM preference (ΔAIC > 6000)

Model Specifications

GAMM (mgcv::bam)

  • Smooth terms (k=5) for indices & covariates
  • Cyclic splines for hour, day-of-year
  • Random effects: station, month
  • AR1 via rho parameter


Model types:

  • Negative binomial (nbinom2) for count responses
  • Binomial for presence/absence responses

Question 1: Index Reduction

Is our approach appropriate? Should we reduce further?

Index Reduction: What We Did

Step 1: Correlation pruning

  • Removed one index from each pair with |r| > 0.6
  • Result: 60 → 17 indices

Step 2: VIF screening

  • Iteratively removed indices with VIF > 2
  • Result: 17 → 14 indices

Outcome:

  • 14 indices retained
  • All 5 categories preserved (Amplitude, Complexity, Diversity, Spectral, Temporal)

Index Reduction: Concerns

  1. Is 14 indices too many?

  2. Is correlation + VIF the right approach?

    • Alternatives: PCA, LASSO, elastic net (others?)
  3. Model shrinkage removed 4 more

    • GAMM select=TRUE shrunk ADI, BioEnergy, EPS_KURT, MEANt to ~zero
    • Should we report 14 predictors or 10 “effective” predictors?
    • Should we formalize this as a two-stage approach? Remove the “shrunk” indices and rerun?

Index Reduction: Questions for Discussion


Q1: Is correlation + VIF standard practice, or is there a more appropriate approach?


Q2: Given that GAMM shrinkage removed 4 indices, should we adopt a two-stage approach (VIF → model-based selection)?


Q3: 14 predictors for 13K observations — but effective sample size is lower due to temporal autocorrelation.

Question 2: Modeling Results

What are our results telling us? Any concerns?

GAMM Results: Significant Predictors

Term EDF p-value Interpretation
hour_of_day 8.24 <0.001 Strong diel pattern
ACI 2.66 <0.001 Non-linear, positive
BI 2.82 <0.001 Non-linear, negative
EAS 3.08 <0.001 Non-linear
VARt 2.94 <0.001 Non-linear
depth 1.00 <0.001 Linear, negative


Shrunk away (not significant): ADI, BioEnergy, EPS_KURT, MEANt

GAMM Smooth Plots: Overview

Smooth Zoom: hour_of_day

Observations:

  • Strong diel pattern (EDF = 8.2)
  • Peak activity ~8 PM (hour 20)
  • Lowest ~10 AM (hour 10)

Validation:

  • Matches known fish calling behavior
  • Model is capturing real biology

Smooth Zoom: BI (negative relationship)

Observation:

  • Higher BI → less fish activity
  • Counterintuitive?

Possible explanations:

  • BI elevated when other sources dominate (snapping shrimp?)
  • Fish call when BI is lower
  • Seasonal confounding?

Question: Ecologically interpretable or artifact?

Smooth Zoom: VARt (non-linear)

Observation:

  • “Goldilocks” relationship
  • Fish activity peaks at intermediate VARt
  • Drops at both extremes

Implication:

  • This non-linear pattern is common in ecology
  • GAMM smooths capture it naturally

Unexpected Results

Temperature:

  • NOT significant in GAMM (p = 0.12)

Day of year:

  • NOT significant in GAMM (p = 0.18)
  • Despite visible seasonality in data →


Hypothesis: Acoustic indices absorb the seasonal/temperature signal?

Methodological Concerns

  1. 10 of 14 indices significant
    • Genuine signal or overfitting?
    • Large sample size (13K) means small effects can be significant
  2. AR1 autocorrelation
    • Currently using fixed rho = 0.6
    • Should we estimate rho from the data?
  3. Indices absorbing environmental signal?
    • Temperature and seasonality not significant
    • But indices vary with both — collinearity concern?

Question 3: Validation Approach

How much validation do we need for a journal paper?

Inference vs Prediction: Options

Approach What it means Pros Cons
Inference only Full-data GAMM, report coefficients Simpler, answers “are there relationships?” Reviewers may question generalizability
Inference + light CV Add leave-one-station-out validation Shows relationships are robust Slightly more work
Full predictive Extensive CV, prediction metrics Strongest for “applied monitoring” framing Scope creep


Our tentative plan: Inference + light CV

  • Primary: Understanding which indices relate to community metrics
  • Supporting: Simple CV showing relationships generalize
  • Future work: Operational prediction applications

Summary: Questions for Discussion

  1. Index reduction: Is correlation + VIF appropriate? Should we reduce further, or is model shrinkage sufficient?

  2. Interpretation: Temperature/seasonality not significant — absorbed by indices, or a problem?

  3. Validation: Is inference + light CV sufficient for publication? What kind of CV would reviewers expect?

  4. Next steps: Expand to all 9 responses? Additional diagnostics?

Additional Context

  • Pilot mode: Results shown are for fish_activity only — will expand to all 9 responses
  • MEANt: No real variation in raw data (numerical noise ~10⁻¹⁹) — model correctly shrunk it away
  • Repository: [link] — full specs, code, and data pipeline available